Deep Learning Optimizer Dynamics

INFO 526 - Fall 2024 - Project #2

A visual exploration of gradient descent optimizers, including Gradient Descent, Stochastic Gradient Descent (SGD), Momentum, AdaGrad, RMSProp, and Adam, comparing their convergence behavior, efficiency, and suitability across various optimization landscapes and real-world datasets.
Author
Affiliation

Stat Squad

School of Information, University of Arizona

Abstract

This project aims to provide an in-depth visual analysis of various gradient descent-based optimization algorithms, including Gradient Descent, Stochastic Gradient Descent (SGD), Momentum, AdaGrad, RMSProp, and Adam. By simulating these optimizers on different synthetic loss landscapes and applying them to a real-world dataset, the project will explore their distinct convergence behaviors, efficiency, and adaptability. Through dynamic visualizations and interactive tools, the project will help users understand the strengths and limitations of each optimizer, enabling better selection of optimization techniques for different machine learning tasks.

Introduction

Optimization is a fundamental component of machine learning, as it drives the process of training models by minimizing a loss function. One of the most widely used optimization techniques is gradient descent, which iteratively adjusts model parameters to reduce the error in predictions. However, the basic gradient descent algorithm has limitations, such as slow convergence and sensitivity to the choice of learning rate. To address these challenges, several optimizers have been developed to improve the efficiency, stability, and adaptability of gradient descent.

This project focuses on six popular gradient descent-based optimizers: Gradient DescentStochastic Gradient Descent (SGD)MomentumAdaGradRMSProp, and Adam. These optimizers differ in how they adjust the learning rate, incorporate past gradients, and handle noisy data. While each optimizer has its advantages and drawbacks, understanding their behavior and performance on different types of loss functions is crucial for selecting the most appropriate method for a given machine learning task.

The goal of this project is to visually demonstrate the convergence behaviors and efficiency of these optimizers in various optimization landscapes, such as convex, non-convex, and saddle-point surfaces. Using both synthetic data and real-world examples, the project will explore how each optimizer performs across different types of loss surfaces. By generating dynamic, animated visualizations, we aim to provide a deeper understanding of how each optimizer functions, offering a valuable tool for students and practitioners to choose the right optimization technique for their machine learning models.

Approach:

1. Synthetic Data

  • Objective: Explore optimizer performance on different synthetic loss landscapes (convex, non-convex, saddle point).

    • Gradient Descent: Apply on quadratic function (convex).

    • Stochastic Gradient Descent (SGD): Apply on quadratic function, check for noisy convergence.

    • Momentum: Apply on quadratic function and Rosenbrock function to test smoother convergence.

    • AdaGrad: Apply on non-convex and saddle point functions to see how learning rate adapts.

    • RMSProp: Test on non-convex and saddle point functions for faster convergence.

    • Adam: Apply on all landscapes (quadratic, Rosenbrock, saddle point) to compare convergence stability and speed.

Exploring Optimizers on Synthetic Data

Synthetic data helps us simulate and study optimizer behavior in controlled environments where we know the true function and its minimum. This allows for a more straightforward analysis of how optimizers perform in different loss landscapes. For the synthetic data, we will generate loss surfaces for three different types: convex, non-convex, and saddle-point.

Generating Synthetic Data in R:

  • Quadratic Function (Convex Surface):

    • Formula: \[f(x, y) = x^2 + y^2\]

    • Minimum: Located at (0,0).

    • Description: A smooth, convex paraboloid that is symmetric around the origin, making it a straightforward example to observe the convergence path.

  • Rosenbrock Function (Non-Convex Surface):

    • Formula: \[f(x, y) = (a - x)^2 + b \cdot (y - x^2)^2\]

    • Parameters: Typically, a=1 and b=100.

    • Minimum: Located at (a, a^2), typically (1,1) with the default parameters.

    • Description: This non-convex function has a narrow valley that requires careful adjustment to navigate. It’s often used to highlight how optimizers handle complex landscapes.

  • Saddle Point Function (Mixed Curvature Surface):

    • Formula: \[f(x, y) = x^2 - y^2\]

    • Saddle Point: Located at (0,0).

    • Description: This function has a saddle point at the origin. It’s concave along one axis and convex along the other, providing a useful case for understanding optimizers’ sensitivity to directional changes in curvature.

Gradient Descent

  • Mathematical Intuition: Gradient Descent updates parameters based on the negative gradient of the loss function. For a function f(x)f(x), the update rule is:

    \[\theta_{t+1} = \theta_t - \eta \nabla f(\theta_t)\]

    where ηη is the learning rate and ∇f(θt)∇f(θt​) is the gradient at point θtθt​.

  • R Implementation:

  • Explanation: In this visualization, we start at the initial point ( 4 , 4 ) on the surface of the quadratic function 𝑓 ( 𝑥 , 𝑦 ) = 𝑥 2 + 𝑦 2. The gradient descent algorithm iteratively updates the values of 𝑥 and 𝑦 by moving in the direction of the steepest descent, determined by the negative gradient of the function.

  • At each step, arrows are drawn from the current position to the updated position, showing the direction and magnitude of the movement. The length of each arrow represents the step size, which decreases as we approach the minimum due to the nature of the gradient being proportional to the distance from the origin.

  • The trajectory clearly illustrates how gradient descent converges toward the global minimum at ( 0 , 0 ), where the function achieves its lowest value. This process highlights how gradient descent works effectively in a smooth, convex landscape like this quadratic function.

Stochastic Gradient Descent (SGD)

  • Mathematical Intuition: In SGD, instead of using the full gradient, we compute the gradient using a single random data point. This makes the process noisy but can be more efficient in large datasets.

  • R Implementation:

  • Explanation:  In this visualization, we begin at the initial point ( 4 , 4 ) and update the values of 𝑥 and 𝑦 at each step using the stochastic gradient descent algorithm. Instead of calculating the full gradient, noise is added to the gradient at each iteration to simulate a stochastic environment.

  • Arrows are used to represent the direction and magnitude of the updates, showing the noisy movement toward the global minimum at ( 0 , 0 ). Unlike standard gradient descent, stochastic gradient descent introduces small oscillations due to the random noise, which can help in escaping local minima in more complex landscapes.

  • The visualization highlights the efficiency of stochastic gradient descent in large datasets where calculating the full gradient at each step may be computationally expensive. The oscillations, however, underscore the trade-off between stability and speed in optimization.

Momentum Optimizer

  • Mathematical Intuition: Momentum is an optimization technique that helps accelerate gradient descent by adding a fraction of the previous update to the current update. This reduces oscillations and helps in faster convergence.

  • Update Rule:

    \[v_t = \beta v_{t-1} + (1 - \beta) \nabla f(\theta_t)\]

    \[\theta_{t+1} = \theta_t - \eta v_t\]

    where:

    • vtvt​ is the velocity (momentum),

    • ββ is the momentum coefficient (usually close to 1),

    • ηη is the learning rate,

    • ∇f(θt)∇f(θt​) is the gradient at the current step.

  • Momentum helps speed up the convergence by adding a fraction of the previous velocity to the current gradient.

  • R Implementation:

  • Explanation: This visualization starts at the initial point (4,4) and simulates the updates using the momentum optimization technique. At each step

    • The gradient is calculated, and a fraction of the previous velocity is added to the current gradient update.

    • This velocity term reduces oscillations and accelerates convergence toward the global minimum.

    • Arrows indicate the direction and magnitude of updates. The momentum term helps the optimizer move smoothly and quickly, demonstrating how it handles optimization in a convex landscape.

AdaGrad Optimizer

  • Mathematical Intuition: AdaGrad adapts the learning rate for each parameter based on the historical gradients. Parameters that have frequently large gradients get smaller updates, and parameters with infrequent large gradients get larger updates.

  • Update Rule:

    \[ g_t = \nabla f(\theta_t) \]

    \[ G_t = G_{t-1} + g_t^2 \]

    \[ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot g_t \]

    where:

    • GtGt​ is the sum of squared gradients,

    • ϵϵ is a small constant to prevent division by zero.

  • AdaGrad ensures that frequently updated parameters have smaller learning rates.

  • R Implementation:

  • Explanation: We start at the initial point ( 4 , 4 ) on the quadratic surface 𝑓 ( 𝑥 , 𝑦 ) = 𝑥 2 + 𝑦 2 . The AdaGrad optimizer updates the learning rate for each parameter adaptively, based on the accumulated squared gradients. Parameters with frequently large gradients get smaller updates. Parameters with infrequent large gradients get larger updates. Arrows represent the direction and magnitude of updates at each step. The trajectory converges toward the global minimum at ( 0 , 0 ), demonstrating how AdaGrad adjusts its learning rate dynamically to achieve convergence in a smooth, convex optimization landscape.

RMSProp Optimizer

  • Mathematical Intuition: RMSProp is an adaptive learning rate method that divides the learning rate by a moving average of the squared gradients. It works similarly to AdaGrad but with a smoothing term to prevent the learning rate from decaying too quickly.

  • Update Rule:

    • Compute the first moment (gradient) gtgt​:

      \[ g_t = \nabla f(\theta_t) \]

    • Update the second moment vtvt​ with exponential decay:

      \[ v_t = \beta v_{t-1} + (1 - \beta) g_t^2 \]

    • Update the parameter \[\theta_{t+1}\] using the RMSprop rule:

      \[ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t + \epsilon}} \cdot g_t \]

       where:

      • vtvt​ is the moving average of squared gradients,

      • ββ is the smoothing constant, typically set to 0.9,

      • ϵϵ is a small constant for numerical stability.

    • RMSProp adapts the learning rate based on recent gradient information, helping to prevent the learning rate from shrinking too quickly.

  • R Implementation:

  • Explanation:

    • We start at the initial point (4,4) on the quadratic surface f(x, y) = x^2 + y^2.

    • The RMSProp optimizer updates the learning rate dynamically using a moving average of the squared gradients. This ensures:

      • Gradients that occur frequently do not overly diminish the learning rate.

      • Gradients that occur rarely still allow meaningful updates.

    • The momentum-like smoothing term (β) helps prevent the learning rate from decaying too quickly, unlike AdaGrad.

    • Arrows represent the direction and magnitude of updates, showcasing the efficiency of RMSProp in converging toward the global minimum at (0,0). This visualization highlights the balance RMSProp strikes between learning rate adaptability and update stability.

Adam Optimizer

  • Mathematical Intuition: Adam (Adaptive Moment Estimation) combines the ideas of both Momentum and RMSProp. It tracks both the first moment (mean) and second moment (variance) of the gradients to adapt the learning rate for each parameter. It also uses bias correction to account for the initialization of the first and second moment estimates.

  • Update Rule:

    1. First Moment Estimate (Momentum):\[ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \]

      Where \[m_t\] is the first moment estimate (momentum) and \[g_t\] is the gradient at time t.

    2. Second Moment Estimate (RMSprop-like):

      \[ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \]

      Where \[v_t\] is the second moment estimate (similar to RMSprop).

    3. Bias-Corrected First Moment Estimate:

      \[\hat{m}_t = \frac{m_t}{1 - \beta_1^t}\]

      This corrects the bias introduced by the initialization of mtmt​.

    4. Bias-Corrected Second Moment Estimate:

      \[\hat{v}_t = \frac{v_t}{1 - \beta_2^t}\]

      This corrects the bias introduced by the initialization of vtvt​.

    5. Parameter Update:

      \[\theta_{t+1} = \theta_t - \frac {\eta}{\sqrt{\hat{v}_t + \epsilon}} \hat{m}_t\]

    6. Where:

      • η is the learning rate,

      • ϵ is a small constant for numerical stability,

      • m is the bias-corrected first moment estimate,

      • v​ is the bias-corrected second moment estimate.

  • Adam is highly effective and widely used due to its ability to adapt the learning rate(based on the gradient’s first and second moments) and improve convergence for a wide variety of tasks.

  • R Implementation:

  • Explanation:

    • The optimizer starts at the initial point (4,4) on the quadratic surface f(x, y) = x^2 + y^2.

    • Adam combines the benefits of:

      • Momentum: Using the first moment estimate (mtm_tmt​) to smooth updates and reduce oscillations.

      • RMSProp: Using the second moment estimate (vtv_tvt​) to adapt the learning rate dynamically for each parameter.

    • Bias correction ensures that the first few updates are unbiased, despite starting with zero moments.

    • Arrows represent the direction and magnitude of updates at each step, showing the optimizer’s ability to adapt learning rates and converge efficiently to the global minimum at (0,0).

Visualizing the different optimizers behavior

Efficiency Comparison

  1. Speed:

    • Fastest: Adam, RMSProp, and SGD (for large datasets).

    • Slowest: Gradient Descent due to computation over the entire dataset at each step.

  2. Convergence Stability:

    • Stable: Momentum, Adam, and RMSProp are robust to oscillations.

    • Noisy: SGD can oscillate or overshoot due to randomness.

  3. Adaptability:

    • Most Adaptive: Adam (adjusts learning rate dynamically for each parameter).

    • Least Adaptive: Gradient Descent (fixed learning rate for all parameters).

  4. Handling Sparse Data:

    • Best: AdaGrad and Adam are designed for sparse gradients.

    • Worst: Gradient Descent and Momentum may struggle without careful tuning.

Conclusion

  • Adam is widely regarded as the most efficient and versatile optimizer for general use cases due to its adaptive nature, fast convergence, and stability. However, for very specific tasks, other methods like RMSProp or AdaGrad may perform better.

2D static images

Comparison Between Optimizer’s

Optimizer Description Pros Cons Best Use
Gradient Descent Full gradient each update Stable convergence Slow, costly Small, static datasets
SGD One sample/batch per update Fast, escapes minima High variance, less stable Large datasets, limited resources
Momentum Adds momentum to smooth updates Faster, reduces oscillation Adds tuning complexity Oscillatory paths
AdaGrad Adapts learning rate per parameter Good for sparse data Learning rate decays Sparse datasets
RMSProp Moving avg. squared gradients Solves AdaGrad decay issue Needs decay rate tuning Non-stationary tasks (e.g., RNNs)
Adam Combines Momentum + RMSProp Fast, robust to noise Complex, more tuning needed General-purpose neural networks